Skip to content

Conversation

@motjuste
Copy link
Contributor

@motjuste motjuste commented Mar 14, 2025

Description

dss now supports running on Canonical Kubernetes instead of microk8s. This support is currently available in channels for version 1.1. This PR adds support for testing dss on Canoncial Kubernetes.

Updates to the provider

  • The install-deps script now accepts an argument to install Canonical Kubernetes instead of microk8s. It still installs microk8s by default.
  • install-deps now also installs the helm snap which will be used to enable NVIDIA GPU support in both the Kubernetes variants.
  • There's now a single k8s_gpu_setup.py that can be used to setup GPU from both Intel and NVIDIA on both microk8s and Canonical Kubernetes.
    • Enabling Intel GPU remains largely the same, if not made simpler by using kubectl apply with -k directly. Support for setting a specific number of slots-per-GPU has been removed as it is not relevant to testing DSS at the moment.
    • helm is used for enabling NVIDIA GPU support in the Kubernetes cluster, roughly following this guide.
    • microk8s is detected by the script and relevant customisation for containerd is done by the script automatically.

The provider-snap's minor version has been bumped to indicate since when we added support for Canonical Kubernetes.

Updates to the GitHub Workflows

  • Added a reusable workflow checkbox-dss-build.yaml to build the checkbox-dss snap.
  • The existing workflow to run the tests in Testflinger has been updated to:
    • Use the snap-building workflow to build the snap only once and pass it as an attachment in the Testflinger jobs created by the matrix.
    • Get rid of an explicit matrix of different snap channels, and accept them as inputs for workflow dispatch instead. (However, the matrix of different Testflinger queues is kept in place.)

Resolved issues

Documentation

Updated the README for this provider. No changes to main Checkbox documentation.

Tests

One of the machines has been failing to provision today, but the tests have passed on the other two machines using the updated workflow.

motjuste added 11 commits March 14, 2025 15:33
We will need helm when installing canonical k8s to enable NVIDIA GPU
operator in it.

Canonical k8s (and helm) will only be installed if the explicit argument
for the channel to use is provided.  Otherwise, the old default
behaviour of install microk8s will be maintained.
We use helm to add the relevant chart from nvidia and install the chart.
We re-use existing script to verify the rollout too.
This job needs to run after either one of the two jobs above it enabling
NVIDIA GPU in the k8s cluster succeed (one is for microk8s, the other is
for Canonical k8s).  We can't use those jobs in 'depends' because then
both of them will have to succeed, which is impossible because only one
of either microk8s or Canonical k8s will be available.

The trick we use here is that we now `depends` on `dss/initialize`,
which must succeed for the whole test-plan to be run anyway, and,
we require that we have an NVIDIA GPU.  This is similar to the `depends`
for the two jobs for microk8s and Canonical k8s.  Then we will have to
be careful that this job is added in the test-plan to ONLY run after
those two jobs.  The difference will then be that this job will not be
skipped if either of the two jobs enabling NVIDIA GPU fail.
We now have an addition to the `install-deps` script. It also demarcates
from whence we started supporting Canonical K8s
The "worker" daemonset that was being verified may have a version number
in its name, which we cannot predict.
@codecov
Copy link

codecov bot commented Mar 14, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 50.67%. Comparing base (d600c6b) to head (3b033ed).
⚠️ Report is 134 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1793      +/-   ##
==========================================
+ Coverage   50.44%   50.67%   +0.23%     
==========================================
  Files         382      384       +2     
  Lines       41026    41219     +193     
  Branches     6890     6890              
==========================================
+ Hits        20696    20889     +193     
  Misses      19585    19585              
  Partials      745      745              
Flag Coverage Δ
provider-dss 100.00% <100.00%> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

NVIDIA GPU operator can be enabled in both microk8s and
canonical k8s using helm, so we remove all the ugly parts so far that
was trying to handle whether microk8s or canonical k8s were installed,
and just use the unified approach using helm.

Helm now becomes a hard requirement.
@fernando79513
Copy link
Collaborator

As discussed in:

@motjuste
Copy link
Contributor Author

As discussed in:

@fernando79513 ... I was able to compress setting up K8s for NVIDIA and Intel GPUs, and moved them into a Python script (see this commit) ... Is this what you were expecting?

Personally, now that the setup is nicely compressed, I don't see too much value in wrapping them in a Python script.

Let me know if you still prefer this Python script, and what sort of unit-tests you believe it requires.

@motjuste motjuste marked this pull request as draft April 30, 2025 13:27
motjuste added 11 commits April 30, 2025 19:20
The labels take some time to propagate
If there's a TimeoutError, `microk8s status` was still executing, so it
is there ... even thought it does not tell us whether microk8s is in use
or not.  Anyway, throw the error, instead of deciding that there is no
microk8s.
The validator container may not have been created even after the
daemonset is rolled out (for some unknown reason), hence we wait before
checking the logs.  And, since checking the logs will wait for the
validations so succeed, this job may take considerably longer.
It is FileNotFoundError that is raised when microk8s is not installed.
So we are not going to try catch all other CalledProcessErrors, for now.
we need snapd > 2.59 for SNAP_UID
@motjuste
Copy link
Contributor Author

motjuste commented May 5, 2025

@fernando79513 ...

Sorry for the delay, but I got stuck in some weird behaviour of Helm-installing the Nvidia GPU operator (see my comment in the script).

Furthermore, I needed to add some special handling for installing the Nvidia operator on microk8s because it has a different setup for containerd. The script now automatically detects if microk8s is running, and appropriately configures the operator.

@motjuste motjuste marked this pull request as ready for review May 5, 2025 10:40
@motjuste
Copy link
Contributor Author

motjuste commented May 22, 2025

Closing this without merging. To be picked again as part of CHECKBOX-1898

@motjuste motjuste closed this May 22, 2025
@motjuste motjuste deleted the CHECKBOX-1781-add-canonical-k8s branch August 21, 2025 11:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants